Chromosome-level genome assembly of the yellow-cheek carp Elopichthys bambusa

Yellow-cheek carp (Elopichthys bambusa) is a typical large and ferocious carnivorous fish endemic to East Asia, with high growth rate, nutritional value and economic value. In this study, a chromosome-level genome of yellow-cheek carp was generated by combining PacBio reads, Illumina reads and Hi-C data. The genome size is 827.63 Mb with a scaffold N50 size of 33.65 Mb, and 99.51% (823.61 Mb) of the assembled sequences were anchored to 24 pseudo-chromosomes. The genome is predicted to contain 24,153 protein-coding genes, with 95.54% having functional annotations. Repeat elements account for approximately 55.17% of the genomic landscape. The completeness of yellow-cheek carp genome assembly is highlighted by a BUSCO score of 98.4%. This genome will help us understand the genetic diversity of yellow-cheek carp and facilitate its conservation planning.


Background & Summary
Yellow-cheek carp (Elopichthys bambusa), also known as "water tiger", is a species in the order Elopichthys, subfamily Leuciscinae and family Cyprinidae.Yellow-cheek carp is a typical large and ferocious carnivorous fish endemic to East Asia.In China, it is mainly distributed in river systems such as the Yangtze River, Pearl River and Yellow River 1 .Yellow-cheek carp lives in the upper layer of rivers and lakes, it has a strong swimming ability and chases other fish for food.Yellow-cheek carp can prey on diseased and weak fish to control their population size, which is of great significance for maintaining the ecological balance of the water environment 2 .Yellow-cheek carp is also an important characteristic economic fish with firm meat, delicious taste, and rich in high-quality protein, unsaturated fatty acids, minerals and other nutrients [3][4][5] .However, anthropic factors such as overfishing, hydrological modification and water pollution have led to the dwindling natural resources of yellow-cheek carp 6,7 , which has been listed in the "Key Protected Endangered and Threatened Aquatic Species" and the IUCN Red List of Threatened Species (Version 2020.3) 8 .
The typical carnivorous yellow-cheek carp is particularly special among East Asian carp species that are mainly omnivorous and herbivorous.For example, yellow-cheeked carp and grass carp both belong to the subfamily Leuciscinae and had the closest relationship.Interestingly, they have evolved completely opposite feeding habits 9 , which provides excellent material for studying the evolution and genetic regulation mechanisms of fish feeding habits.However, the lack of genomic information limits the study on the carnivorous formation mechanism of yellow-cheek carp.At the same time, higher breeding profits have also promoted the continuous development of the artificial breeding industry of yellow-cheek carp.Using live fish or frozen fish as the main bait not only results in higher breeding costs for yellow-cheeked carp, but also easily causes pollution of the aquaculture water, which greatly restricts the expansion of the farming scale 10 .Therefore, research on the dietary transformation of typical carnivorous fishes such as yellow-cheek carp has gradually become a hot topic, and there is an urgent need for genetic breeding of yellow-cheek carp based on whole-genome information.
In this research, we have combined PacBio long-read sequencing, Illumina short-read sequencing and Hi-C technology to generate a high-quality chromosome-level genome of the yellow-cheek carp (Fig. 1).Accordingly, we expect rapid progress in the genetics research of yellow-cheeked carp, and functional genes related to key economic traits of yellow-cheeked carp will continue to be discovered.The elucidation of the genome structures and functions will promote more in-depth research to better understand the genetic basis for the formation of important traits such as the carnivorous in yellow-cheeked carp, thereby making contributions to its resource protection, genetic selection and artificial breeding.

Methods
Sample collection and sequencing.An adult male yellow-cheek carp was collected from the Yangtze River in Wuhan, Hubei, China.High-quality genomic DNA was extracted from muscle by the CTAB method for Illumina sequencing, PacBio SMRT sequencing 11 and Hi-C.The quality of the extracted DNA was assessed using agarose gel electrophoresis and NanoDrop Spectrophotometer (Thermo Fisher Scientific, USA), and quantified by a Qubit Fluorometer (Invitrogen, USA).
For Illumina sequencing, the genomic DNA was randomly sheared to 300~500 bp fragments, and a paired-end genomic library was prepared following the manufacturer's protocol.Then, the library was sequenced on an Illumina NovaSeq platform using a paired-end 150 bp layout to enable genome survey and base-level correction.For PacBio long-read sequencing, SMRTbell libraries were constructed using the genomic DNA and sequenced on the PacBio Sequel II sequencing platform.After, approximately 58.98 Gb of Illumina To generate a chromosomal-level assembly of the yellow-cheek carp genome, a Hi-C library was generated using the DNA extracted from the same yellow-cheek carp.After cell crosslinking, cell lysis, chromatin digestion, biotin labelling, proximal chromatin DNA ligation and DNA purification, the resulting Hi-C library was subjected to paired-end sequencing with 150 bp read lengths on an Illumina NovaSeq platform.Finally, the size of Hi-C data obtained was 151.98 Gb, covering 183.78× of the genome.
To aid genome annotation, the total RNA from muscle, spleen, gonad and skin was extracted and tested for purity and integrity using a NanoDrop Spectrophotometer (Thermo Fisher Scientific, USA) and Agilent 2100 bioanalyzer (Agilent Technologies, USA).The RNA library was constructed using the NEBNext ® UltraTM RNA Library Prep Kit (Illumina, USA) following the manufacturer's protocol and sequenced on an Illumina NovaSeq.6000 platform.Finally, 23.74 Gb of data was obtained (Table 1).
Genome assembly.First, SOAPnuke (v2.1.0) 12was used to perform quality control of Illumina data, and the clean data were utilized for genome size estimation.K-mer analysis 13 was conducted using GCE (v1.0.2).As a result, the genome size was estimated to be 786.16 Mb, with a heterozygosity ratio of 0.47% and repeat sequence ratio of 47.03% (Table 2).A total of 27.35 Gb PacBio long-read data were used for de novo genome assembly using MECAT2 (v2.0.0) 14 and NextDenovo (v2.4.0).The polishing was then carried out by the software gcpp (v2.0.2) and pilon (v1.22) 15 .Based on these sequencing data, the resulting assembly consists of 170 contigs and has a total length of 827.63 Mb (Table 3).
Hi-C scaffolding.The Hi-C technology was used for chromosome-level genome assembly.The Trimmomatic 16 with parameters (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50) was used to remove adapters and low-quality fragments of the raw Hi-C reads data.The processed reads were then aligned to the assembly using the Juicer (v1.6) 17 with default settings.Contigs were scaffolded using 3D-DNA pipeline 18 with all valid Hi-C reads.We use the Juicebox (v2.13.07) 17 to adjust the chromosome-scale scaffolds manually(-Fig.2, Table 4).And there are 141 gaps among the 24 chromosomes.
Repeat annotation.We used de novo prediction and homology comparison to annotate the genomic repetitive sequences.RepeatModeler 19 were used to detected and classified the repetitive sequences in the genome assembly using tools including RECON(v1.08) 20, RepeatScout(v1.0.5) 21, LTR-FINDER(v1.0.5) 22 and TRF (v4.0.935) 23 .For homology comparison, RepeatMasker (open-4.0.9) and RepeatProteinMask (open-4.0.9) were used to identify the known TEs of the yellow-cheek carp genome in the Repbase TE library 24,25 and TE protein database, respectively.The results showed that the genome repetitive sequence size was 456.66 Mb, accounting for 55.17% of the assembled genome.Among the repeat elements, short interspersed nuclear elements (SINEs) accounted for 0.24% of genome size and long interspersed nuclear elements (LINEs) accounted for 7.67%.Long terminal repeats (LTRs) and DNA elements accounted for 12.31% and 34.87%, respectively (Table 5).

Protein-coding gene prediction and annotation.
In this research, the ab initio gene prediction, homology-based gene prediction and transcript prediction were used to predicted protein-coding genes of the yellow-cheek carp genome.Prior to gene prediction, the assembled yellow-cheek carp genome was hard and soft masked using RepeatMasker.The ab initio gene prediction was performed using Augustus (v3.3.1) 26,27and Genescan (v1.0) 28 .Models used for each gene predictor were trained from a set of high-quality proteins generated from the RNA-Seq data.For the homology-based prediction, Glimmer HMM(v3.0.4) 29   protein sequences to our genome assembly and predict coding genes with the default parameters.The reference protein sequences of five fish species, including Ctenopharyngodon idella, Sinocyclocheilus grahami, Megalobrama amblycephala, Danio rerio and Cyprinus carpio, were sourced from the NCBI database.For the transcript prediction, clean RNA-Seq reads were assembled into the yellow-cheek carp genome using Stringtie (v2.1.1) 30.Then the gene structure was formed using PASA (v2.4.1) 31 .To consolidate the results from these three methods, MAKER (v3.00) 32 was employed to enable the merging and integration of gene predictions.
For functional annotation of predicted gene, BLASTP (v2.6.0) 33,34was used to align the anticipated genes to the Kyoto Encyclopedia of Genes and Genomes (KEGG) 35 , Gene Ontology (GO) 36 , NCBI-NR (non-redundant protein database), Swiss-Prot 37 , TrEMBL 38 and InterPro 39 database.In total, we successfully predicted 24,153 protein-coding genes within the genome.These predicted genes displayed an average coding sequence length of 1638.21 bp, an average gene length of 18969.98 bp, and an average exon number of 9.87 (Table 6).Further, 22,965 genes, which accounts for 95.54% of the total number of predicted genes, were successfully assigned with at least one functional annotation (Table 7). of non-coding RNa genes.The tRNAscan-SE (v1.3.1) 40algorithms with default parameters were used to identify the genes associated with tRNA.We downloaded the closely related species rRNA sequences from the Ensembl database.Then rRNAs in the database were aligned against our genome using BLASTn (v2.6.0) 41with E-value <1e-5, identity ≥85% and match length ≥50 bp.The miRNAs and snRNAs were identified by Infernal (v1.1.2) 42 software against the Rfam (v14.1)database with default parameters.As a result, we annotated 76 rRNAs, 2469 tRNAs, 291 MiRNAs and 212 snRNAs (Table 8).

Data Records
All the raw sequencing data have been deposited in the NCBI database under the accession number SRP470306 43 .The genome assembly has been deposited at GenBank under the accession GCA_037101425.1 44 .Genome annotations, along with predicted coding sequences and protein sequences, can be accessed through the Figshare 45 .

Fig. 2
Fig. 2 Genome-wide Hi-C interaction mapping of chromosome sections.

Table 1 .
was used to align the Statistics of the sequencing data used for genome assembly.

Table 2 .
K-mer frequency and genome size evaluation of yellow-cheek carp genome.

Table 3 .
Statistics for Hi-C assisted assembly.

Table 4 .
Chromosome and reference genome corresponding chromosome statistical results.

Table 5 .
Repetitive elements and their proportions in yellow-cheek carp genome.

Table 8 .
Statistics of non-coding RNA annotation.

Table 9 .
Statistical result of BUSCO evaluation results of genome assembly.